A Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm for Web Sessions

نویسندگان

  • Angana Chakraborty
  • Sanghamitra Bandyopadhyay
چکیده

In this article we propose a Layered Locality Sensitive Hashing Algorithm to perform similarity search on the web log sequence data. Locality Sensitive Hashing has been found to be an efficient technique for the approximate nearest neighbor search over a large database, as it has sub-linear dependence on the data size even for high dimension. Mining the large web log data to provide customised services to the users is one such area where similar sessions are required to be extracted quickly. The variety of session lengths adds extra complexity to this problem. To tackle this dimension variability, the concept of layering in introduced in locality sensitive hashing and a recently proposed web page similarity measure Psim is used. The proposed method is referred to as Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm or LaLSA in short. The similarity at the session level is computed using a fast sequence alignment technique FOGSAA. LaLSA achieves an average time gain of 81.88% with 97.2% accurate result when compared to the exact algorithm, on NASA and ClarkNet web log datasets. Therefore, LaLSA is a time efficient solution to perform similarity search between variable length sequences, where the outputs are almost as good as the exact ones.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...

متن کامل

Fast Information-Theoretic Agglomerative Co-clustering

Our algorithm iteratively merges those clusters whose merge yields a lower objective cost. However, operations such as finding nearest neighbors or closest pair of clusters are expensive, especially in high dimensions. To quickly find highly similar clusters to be merged, we exploit the Locality-Sensitive Hashing (LSH) technique, which we briefly describe in this section. Simply put, LSH [2] is...

متن کامل

Bayesian Locality Sensitive Hashing for Fast Similarity Search

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search i.e. ...

متن کامل

Identifying and Indexing Near-Duplicate Images Using Optimizing Technique in Web Search

Today's World Wide Web is growing drastically and duplicates occur in many fields. Importantly duplicate images that are uploaded into internet like a food product, document image, medical images, textile fields etc. So it becomes very important to identify those duplicate images. Near duplicates can be similar copies or differ a little in their visual content. Duplicate images introduce many p...

متن کامل

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014